DATA PREPERATION

From this analysis, it can be seen that there are too many different values for each feature except "type" column. Since, any information cannot be taken from this feature, we can drop it.

Feature Selection and Evaluation

time_stamp

Divide into sublevels.

Categorical Features

All features should be categorical except selling price.

user_action

product_gender

contentid

product_name

brand_id - brand_name

businessunit

category_id

Level1_Category_Id - Level1_Category_Name

Level2_Category_Id - Level2_Category_Name

Level3_Category_Id - Level3_Category_Name

Above code should be run once for the train data and once for the test data to obtain both in required format.

Both test and train data saved in .csv and .xlsx format.

MODEL EVALUATIONS

PCA for numeric part

There are 84 numeric columns. PCA has showed that with 40 principal components 0.97536578 of the variance can be explained. 84 numerical features will be reduced to 40 with PCA.

CATEGORICAL VARIABLES

Most frequently actioned category variables are too many. To reduce the number check columns and meanings.

freq_level3_cat_Şampuan -12

freq_level3_cat_Çorap -22

freq_level3_cat_Çay -14

freq_level3_cat_Köpek Ürünleri -11

freq_level3_cat_Kedi Ürünleri -22

freq_level2_cat_Pet Shop -35

Combining numeric and categorical features

TRAIN

TEST

Correlations of features of Train Data checked again

Train & Eval Data Split

Oversampling

Random Forests with scoring="roc_auc"

Training with oversampled data

Random Forests with scoring="balanced_accuracy"

Training with oversampled data

Xgboost for Classification - scoring "roc-auc"

with PCA

Oversampling

INCREASING MAX DEPTH since we got 16 (the max given value) for the best model and

DECREASING ETA

Xgboost for Classification - scoring : "balanced_accuracy"

Training with oversampled data

Expanding the grid

Logistic Regression

Model prediction tried in the submission period

Saving the models and loading a saved model